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CONSISTENCY OF OBJECTIVE BAYES FACTORS AS THE 
MODEL DIMENSION GROWS 

By Eli'as Moreno 1 , F. Javier Giron and George Casella 2 
University of Granada, University of Malaga and University of Florida 

In the class of normal regression models with a finite number 
of regressors, and for a wide class of prior distributions, a Bayesian 
model selection procedure based on the Bayes factor is consistent 
[Casella and Moreno J. Amer. Statist. Assoc. 104 (2009) 1261-1271]. 
However, in models where the number of parameters increases as the 
sample size increases, properties of the Bayes factor are not totally 
understood. Here we study consistency of the Bayes factors for nested 
normal linear models when the number of regressors increases with 
the sample size. We pay attention to two successful tools for model 
selection [Schwarz Ann. Statist. 6 (1978) 461-464] approximation to 
the Bayes factor, and the Bayes factor for intrinsic priors [Berger 
and Pericchi J. Amer. Statist. Assoc. 91 (1996) 109-122, Moreno, 
Bertolino and Racugno J. Amer. Statist. Assoc. 93 (1998) 1451-1460]. 

We find that the the Schwarz approximation and the Bayes fac- 
tor for intrinsic priors are consistent when the rate of growth of the 
dimension of the bigger model is 0(n b ) for b < 1. When 6=1 the 
Schwarz approximation is always inconsistent under the alternative 
while the Bayes factor for intrinsic priors is consistent except for a 
small set of alternative models which is characterized. 

1. Introduction. Statistical methodology based on Bayes factors is par- 
ticularly suitable for dealing with multiple hypotheses testing problems when 
the dimension of the parameter spaces varies across models. In such cases, 
Bayesian and frequentist model selection procedures do not necessarily agree 
as the typically ad hoc dimension corrections of the different frequentist cri- 
teria do not provide the same results as those automatically produced by 
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the Bayesian procedures which select models according to the parsimony 
principle. For a recent discussion on the topic see Giron et al. (2006). 

In the class of normal linear regression models, consistency of Bayesian 
variable selection procedures, and, in particular, those using intrinsic priors, 
has been recently established in Casella et al. (2009). There it was shown 
that, under mild regularity conditions, when sampling from a given submodel 
of a regression model with p regressors, the probability of selecting the true 
model tends to one as the sample size n tends to infinity, and the probability 
of selecting any other submodel tends to zero. It was also shown that the 
Schwarz (1978) approximation is, in spite of its simplicity, an accurate tool 
for selecting linear models when there is a small number of parameters and 
the sample size is moderate or large. Those results were obtained for a fixed 
number of regressors p, and hence a finite number of models. Other forms of 
consistency of Bayes factors for variable selection using Zellner's g-prior with 
several hyperpriors on g have been recently studied by Liang et al. (2008). 
These forms of consistency include the consistency when R 2 tends to one, 
when n tends to infinity, and consistency under prediction for squared error 
loss. 

However, in some applications, the number of models increases with the 
sample size. For instance, clustering is an interesting model selection problem 
where the number of models increases as the sample size increases, and the 
question is whether consistency of the Bayesian model selection procedure 
based on intrinsic priors also holds in this latter context. Certainly, it will 
not be possible to consistently estimate the parameters of the underlying 
models, but we wonder whether consistently selecting the true model is still 
possible. 

When the number of parameters increases with the sample size, an anal- 
ysis of the consistency of several frequentist and Bayesian approximation 
criteria for model selection in linear models, including the Schwarz approx- 
imation, was given in Shao (1997). However, the results obtained by Shao 
(1997) do not coincide with ours, as the consistency notion used by Shao is 
not the same as the one we use here. Shao defines a true model to be the 
submodel minimizing the average squared prediction error, and consistency 
of a model selection procedure means that the selected model converges in 
probability to this model. We consider the true model to be the one from 
which the observations are drawn. Of course, the true model may not be 
in the class of models we are considering. In this case consistency does not 
hold although many Bayesian model selection procedures choose models in 
the class that are located as close as possible to the true one where closeness 
is related to a specific "natural" metric [Casella et al. (2009)]. 

We examine consistency in linear models of both the Bayes factors for 
intrinsic priors and the Schwarz approximation (BIC), when the dimension 
of the parameter space of the models increases with the sample size. We find 
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that both the Bayes factor for the intrinsic priors and BIC are consistent 
under the null; however, they might be inconsistent under some alternative 
sampling models. The consistency depends on the rate of divergence of the 
dimension of the null and the alternative linear models. Roughly speaking, 
the BIC and the Bayes factor for intrinsic priors are consistent when the 
rate of growth of the dimension of the full model goes to infinity as 0(n b ) 
for any 6 < 1 . 

When 6 = 1, the BIC is always inconsistent under the alternative while for 
the Bayes factor for intrinsic priors there is an inconsistency region which 
is located in a small neighborhood of the null model. This neighborhood is 
characterized in terms of a "distance" to the null sampling model. In par- 
ticular, for the case of the oneway ANOVA, where 6 = 1, the Bayes factor 
for intrinsic priors is not consistent for all alternative models. This find- 
ing is apparently in contradiction with the results of Berger, Ghosh and 
Mukhopadhyay (2003) who find "that suitable Bayes factors will be consis- 
tent,'''' and hence induces an apparent paradox. However, in Section 4 we 
are able to resolve the apparent contradiction and find that the consistency 
result in Berger, Ghosh and Mukhopadhyay (2003) is obtained by using a 
normal prior centered at the null with variance tending to zero, a situation 
that typically is not obtained by the intrinsic priors. We also observe that 
consistency is obtained for this problem for priors that degenerate to a point 
mass. 

The rest of the paper is organized as follows. In Section 2 we characterize 
the consistency of the BIC and the Bayes factor for intrinsic priors for the 
usual linear regression model for b < 1, demonstrate the inconsistency of BIC 
for 6 = 1 and characterize the small inconsistency region for the Bayes for 
intrinsic priors for 6=1. Section 3 presents some models where the results 
of Section 2 apply, and Section 4 resolves the apparent paradox with the 
results of Berger, Ghosh and Mukhopadhyay (2003). Section 5 provides a 
short, concluding discussion, and there is an Appendix with some technical 
material. 

2. Consistency in linear models. In this section we give, for normal linear 
regression models with parameters increasing with the sample size, condi- 
tions under which the Bayes factor for intrinsic priors asymptotically selects 
the correct model. The finding is that the Bayes factor may not be a con- 
sistent model selector for all parameter values, depending on the rate of 
divergence of the dimension of models. When there is an inconsistency re- 
gion, it is characterized in terms of a "distance" from the alternative to the 
null model. We also show that the BIC model selector is inconsistent when 
sampling from the full model if 6 = 1 . 

Let y = (yi,. . . ,y n )' be a vector of independent responses, X p a design 
matrix of dimension n x p, where p is the number of explanatory vari- 
ables, and let Xj denote a submatrix of X p whose dimensions are nx i. We 



4 



E. MORENO, F. J. GIRON AND G. CASELLA 



compare the reduced sampling model iV(y|Xjaj, of In), arid the full model 
-/V(y|X p /3p, a pin), where the regression parameter vectors a, = (ai, . . . , on)' , 
ftp = ■ ■ ■ , j3 P Y and the variance errors af, a 2 , are unknown. Note that 
the reduced model is nested in the full model. The comparison is based on 
the Bayes factor of model M p versus model Mi, and we remark that it can- 
not be computed by using the reference prior, the usual objective priors, 
since they are improper and hence defined up to an arbitrary multiplicative 
constant. The so-called intrinsic priors that are given below solve this diffi- 
culty [Berger and Pericchi (1996), Moreno, Bertolino and Racugno (1998)]. 
These objective priors have proven to behave very well for multiple testing 
problems [Casella and Moreno (2006)]. 

To derive the Bayes factor for intrinsic priors we start with the improper 
reference priors ir N (ai,ai) = Ci]o{ and TT N (P p ,a p ) = c p /a p where Cj and c p 
are arbitrary positive constants, so we consider the following Bayesian mod- 
els: 

Mi : (^(y|X l a l ,a i 2 I n ),vr Ar (a l ,a i ) = - 

and 

M p :\N(y\X p f3 p ,a 2 p I n ),7r N (f3p,ap) = ^ 

Standard calculations [Moreno, Bertolino and Racugno (1998), and Giron 
et al. (2006)] yield the following intrinsic prior for (P p ,a p ), conditional on 

{ai,<Ji): 

^(Pp^plai,^) = ^ N p (pp\oci,(af + cJp)W~ 1 
A a i +°p) v 

where ol i = (c^, 0'), 0' being the null vector oip — i components, and W" 1 = 
^-(XpXp)" 1 . Then, using the priors {n N (on, ai), n 1 (f3 p , a p \an, ai)}, the Bayes 
factor for comparing the model M p and Mi is 

it Jo (nB ip + (p + 1) sm z ip)( n % >l l 

where 

= RSS P = y'(I w - H p )y 
ip RSSi y'(I n - H*)y 

and Hj = Xj(XjXj) X^-, j = i,p, is the hat matrix. 

We first extend the definition of distance from M p to M, regression models 
given in Casella et al. (2009) to account for models for which the number of 
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parameters and the sample size increase to infinity and define the "distance" 
from M p to Mj for a given sample size n as 

_ 1 a , X^(I n - Hj)Xp a 

° pi ~ af p n Pp - 

The asymptotic performance of the Schwarz approximation is given in 
Theorem 1, and that of the Bayes factor for the intrinsic priors is given in 
Theorems 2 and 3, and Corollary 4. The proofs of these results depend on 
Lemma 1. 

In what follows lim n _ J . 00 [M] Z n will denote the limit in probability of the 
random sequence {Z n ;n > 1} under the assumption that we are sampling 
from model M. This model M will have a fixed parameter sequence. Further, 
we will need to use the doubly noncentral beta distribution with parame- 
ters v 1/2,^2/2 and noncentrality parameters Ai,A2- One way to define this 
distribution is as follows. If Y\,Yi are independent random variables with 
noncentral chi square distributions X 2 (z/i l^i ? -^l) an d X (2/2 1^2) A2), respec- 
tively, then the variable X = Y\/{Y\ + Y2) follows the doubly noncentral beta 
distribution Be(t>i/2, t>2/2; Ai, A2) [Johnson, Kotz and Balakrishnan (1995), 
page 502]. 



Lemma 1. 

1. When sampling from model Mi the distribution of the statistics B- ip is the 
beta distribution Be((n — p)/2, (p — i)/2), and when sampling from model 
M p it is the noncentral beta distribution Be((n — p)/2, (p — i)/2); 0, ra<5w). 

2. Let {X n ,n > 1} be a sequence of random variables such that 

(11 — p p — 1 \ 
— y~ ' ;°' n ^ j ' n>i. 

If i and p vary with n as i = 0(n a ) and p = 0(n b ), where < a < b < 1, 
then: 

(i) If a <b=l, when sampling from model M p 

1 - l/r 



lim [Mp]X n 



5 + 1 ' 



where the constant r satisfies r = linip^oo n/p > 1, and 5 = limn^oo 5 p i. 

(ii) If a = b = \, so there exist two positive constants such that r = 
limp._j.oo n/p > 1 and s = linip^oo n/i > 1, we have 

1 - l/r 
lim [M p )X n - 1 



1 + 6-1/s' 



G 
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(iii) 1/6 <1 



lim [M P ]X, 



1 



1 + 6' 



Proof. See the Appendix. □ 

2.1. Inconsistency of BIC. In this linear model setting we now prove 
that the Schwarz approximation for comparing M p against is inconsistent 
when sampling from M p under certain conditions as first noticed by Stone 
(1979) in a special case. 

Theorem 1. For comparing model M p to model Mi, where Mi is nested 
in M p , and i = 0(n a ) and p = 0(n b ), ifO<a<b< 1, the Schwarz approxi- 
mation, 



is consistent under the null and the alternative. However, if b = 1 it is in- 
consistent under any alternative model M p provided that limn^oo 5 p i > 0. 

Proof. Consistency under the null for both cases follows from part 1 
of Lemma 1 . For b < 1 , we notice that the leading term of the Bayes factor 
is the one involving the statistic Bi p (y), but from part (iii) of Lemma 1 
the limit of the sequence Bi p (y) is a number strictly smaller than 1, and, 
therefore, 



On the other hand, if b = 1, then p = n/r and i = n/s, where r is a positive 
number greater than 1 and s is a number greater than r, the leading term 
of the exponent of the Schwarz approximation is now the first one which is 
strictly negative. Therefore, 



and the proof is complete. □ 

2.2. Consistency of the Bayes factors for intrinsic priors. We now char- 
acterize the consistency of the Bayes factor for intrinsic priors, first assuming 
that both p and n increase at the same rate; that is, r = lirrin^oo^^oo n/p, is 
a strictly positive number. We further assume that the limit of the distance 
5 p i is finite when p and n tend to infinity, and i is either finite or increases 
to infinity at a lower rate than n. Note that in this theorem the constant 




lim [M p ]Spi(y) = oo. 



lim [Mp]S pi {y) = 



6 = 1. 



CONSISTENCY OF OBJECTIVE BAYES FACTORS 



Theorem 2. Suppose that, as the sample size increases, models increase 
their number of parameters with rate i = 0(n a ) and p = O(n), where < a < 
1, and r = lim niP ^. 0O n/p > 1. 

1. When sampling from the simpler model Mi, liixin—^oo 

2. When sampling from the alternative model M p there exists a function 
5(r) such that 



(2) 



lim [M p ]B pi (y) 



oo, if lim S pi > 5(r), 

n— >oo 

0, if lim 6^ < <5(r). 

n— >oo 



Further, this function has the simple expression 

(3) 5(r) = (r+ i)V"i)A-i " 1 

and is a decreasing convex function such that lim rH>00 5(r) = 0. 



Proof. We first prove consistency of B p i{y) under the simpler model 
Mi. The Bayes factor B p i in (1) can be written as 



B P i(y) = - 

7T 



n/2 



1 + 



n 



(p + 1) sin 2 ip 
From Lemma 1 it follows that 



(n-p)/2 



1 + 



nB ip 



{p + 1) sin ip 



-{n-i)/2 



dp. 



lim [Mi]B tp = 

p— »oo T 

and, replacing n by pr, the Bayes factor for large p can be approximated by 



B pi (y) « - 

7T 



tt/2 



1 + 



sin 2 ip 



p(r-l)/2 / r _ 1 \ (*-pr)/2 

1 + — 5— dyj. 
sin p J 



As the integrand is a monotonic increasing function of the angle p, the sup 
is attained at p = vr/2, and, therefore, an upper bound on the integrand 
is (1 + r) p ( r ~ 1 - | / 2 r( l ~ pr )/ 2 . Then, for large p, an upper bound for the Bayes 
factor is 



B P i(y) < 



1HP/2 



,V2 



As the function of r enclosed in square brackets is strictly smaller than 1 for 
r > 1, and the rate of growth of i is strictly smaller than that of p, it follows 
that 



lim 

p— >oo 



(1 + r) 



r-l 



p/2 



J/2 
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for all r > 1, thus proving consistency of the Bayes factor for the intrinsic 
prior under the reduced model Mj. 

Consistency under the full model M p is established as follows. From 
Lemma 1, the limiting distribution of the statistics Bi p under M p is 

1 - 1/r 
lim [M p ]B ip = — 

p->oo + 1 

where 5 is the limit of the the "distance" from the full model to the reduced 
one, which only depends on the limiting behavior of the parameters of the 
full model; that is, 

5 = lim 5 pi = lim — f3 p . 

p~ >oo p—>oo (jp pr 

Therefore, the Bayes factor B p i(y) for large values of p can be approximated 
by 

it Jo V sm z (fj V {l + 5)sm*(pj 

We look at two cases, depending on the values of the parameter S. 

For 5 > 1, the Bayes factor is an increasing convex function of p and this 
implies that the Bayes factor is always consistent. 

For 5 < 1, the argument proceeds as follows. As the integrand is a con- 
tinuous increasing function of ip for all r, 5 and p, then by the mean value 
theorem, there exists a unique value of (fo, say < ifo(r,p,5) < n/2, such 
that for large p the Bayes factor is approximated by 

/ r \P(r-l)/2/ i \(i~pr)/2 

B pi (y)^[l + -^- 1 + 



sin 2 ip (r,p,5) J \ (1 + 5) sin 2 tpo(r,p, 5) _ 

The limit of the sequence {ipo(r,p,5),p > 1} is seen to be equal to n/2 for 
all r, and <5 < 1, Thus, for large values of p, recalling that i = o(p b ), we can 
further approximate the Bayes factor by 

/ r - 1 \ ~H p/2 
(l + rf-Ml 



(4) B pi (y) 



1 + 5 



(It can be checked numerically that even for moderate values of p this ap- 
proximation is very accurate.) Note that when the expression in square 
brackets is greater than 1 consistency holds, and when smaller than 1 the 
Bayes factor is inconsistent. The root of the equation 

(1 + r)' _1 ^1 + = 1, 

is 5(r) of (3), proving the theorem. □ 



CONSISTENCY OF OBJECTIVE BAYES FACTORS 



9 



We remark that the function 6(r) only depends on the lim n/p = r. In 
addition to the limiting value as r — > oo, we also have lim^o °~( r ) = ( e — 1) 
and lirn r ._ s , 1 5(r) = [/o<7(2)] _1 — 1. Notice that the case of equality in the limit 
(2) is not covered by the theorem. It happens that, in this case, we cannot 
make a specific conclusion as there will be parameter values for which there 
is, and there is not, consistency. 

Theorem 2 covers the case in which the dimension of the parameter space 
grows at a rate strictly smaller than that of the sample space. However, it 
does not cover the case where the dimension of the null and the alternative 
space grow at the same rate as the sample size. This case is covered in 
Theorem 3. 



Theorem 3. Suppose that, as the sample size increases, the rates the 
models increase their number of parameters are i = 0(n) and p = 0(n), 
and there exists positive constants r and s such that r = lim n]P _ i . 00 n/p and 
s = \im n , i^^n/i > 1. 

1. When sampling from the simpler model Mi, lim n _j, 00 [M i ]B pi {y) = 0. 

2. When sampling from the alternative model M p , there exists a function 
5(r, s) such that 

{oo, if lim 5 p i> 5(r,s), 
0, if hm 5^ < 8(r,s). 
n— >od 

This function has the following simple explicit form: 

r — 1 1 

(5) ^ S) = ( r + l)*(r-l)/(r( S -l)) _ I ~ 1 + g 

and it is a bounded decreasing convex function in r for fixed s with 
5( r ,s) < l/log2 — 1 for all s > r > 1, and lim. r _ i . 00 5(r, s) = for all s. 
Further, lim^oo 8(r, s) = 5(r) of (3). 

Proof. To prove consistency under the simple model Mj, from Lemma 
1 it follows that 

p->oo r{s — 1) 

and, replacing n by pr, and i/s =pr/s, the Bayes factor for large p can be 
approximated by 

2 W2 / r \P(r-l)/2 

Bpi(y)~ 

(6) 



vr J \ sin if 



s(r _l) y P r(s-l)/(2s) 

X ( 1 + 7 ~Ts ■ 2 

[s — 1) sin (p J 
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As the integrand is a monotonic increasing function of the angle (p, the 
supremum is attained at tp = vr/2, and thus an upper bound of the integrand 
is 

[1 + r)P (r-D/2 U+^- 



Then, for large p, the Bayes factor is bounded from above by 

p/2 



B pi (y) < 



-I \ — r(s— X)/s- 

rs — 1 N v " 



s-1 



but as the function of r and s enclosed in square brackets is strictly smaller 
than 1 for s > r > 1, it follows that the limit of the upper bound of the Bayes 
factor is for all s > r > 1 thus proving consistency of the Bayes factor for 
the intrinsic prior under the reduced model Mi. 

Consistency under the full model M p is proven in a similar way to that 
of Theorem 2. From Lemma 1, the limiting distribution of the statistics Bi p 
under M p is now 

lim [M P ]B ip = ]~ 1/r , 
p~>oo 1 + d — 1/s 

where 5 is the same as in Theorem 3. Following the same course of reasoning 
as in Theorem 2, we finally arrive at the following new approximation for 
the Bayes factor for large values of p: 

_, \r(s-l)/s-|p/2 

l + r y-i( 1 + 



As the expression in square brackets does not depend on p, the limiting 
behavior of the Bayes factor depends on whether this expression is less than 
or greater than 1. Therefore, the new value of the boundary for consistency- 
inconsistency, 5(r,s), is the root of the equation 

/ r _ 1 \ r(8-X)/a 

(1+rf- 1 1 + - — — — =1, 



1 + 6- 1/s 
which is (5). This proves the theorem. □ 

Remark 1. For all s > r > 1, the function 5(r, s) is bounded by a num- 
ber smaller than 1. Note also that if the rate of growth of Mj is smaller than 
that of M p , that is, s — > oo, then it is easy to show that lim s _ >00 5(r, s) = S(r). 

An extension of Theorem 3 to the case where models Mj and M p grow at 
a slower rate than the sample size; that is, i = 0(n a ) and p = 0{n ), where 
< a = b < 1, can be regarded as a limiting case of the preceding theorem 
where both r and s go to infinity. So, we have the following corollary. 

Corollary 4. For i = 0(n a ) and p = 0(n b ) and < a < b < 1, the 
Bayes factor for intrinsic priors is consistent if lim n _ i . 00 5 p i > 0. 
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3. Applications. We look at some practical models for which the results 
of the preceding section can be applied, including various ANOVA mod- 
els, the multiple change point problem, the clustering problem and spline 
regression. 

In particular, the classical ANOVA problem will be illustrated in some 
detail. For instance, we will see that for the one-way ANOVA, and by ex- 
tension any full factorial completely randomized design, the Bayes factor for 
intrinsic priors is inconsistent in a region around the null. However, reducing 
the ANOVA model by eliminating interaction terms recovers consistency. 

3.1. Homoscedastic ANOVA. There is a subtle difference between an 
ANOVA with a full model specification (including all interactions) and one 
with a reduced model specification, as it results in different asymptotic rates. 
We present the results for balanced models with the same number of obser- 
vations per cell, but they can easily be extended to cover the unbalanced 
case. 

3.1.1. Full model specification. We give a detailed development for the 
one-way ANOVA, and then show how the results apply to full factorial 
designs. The null sampling model of the homoscedastic one-way ANOVA, 
M±, where it is assumed that the means are equal to an unknown fi, can be 
written as 



where c is an arbitrary positive constant, l n denotes a vector of n compo- 
nents containing l's, X p is an n x p matrix such that the first r rows are 
equal to the unit vector ei, the next r rows are equal to the unit vector 
e2 and so on, so that the last r rows are equal to the unit vector e p where 



represents the half Cauchy prior density of a, conditional on fi,T, on the 
positive part of the real line. 

Since the dimension of M\ is 2 and the dimension of M p is n + 1 , Theorem 
2 shows that there is an inconsistency region given by those alternative mod- 
els with lim p _ >00 <5 p i < S(r) where 5(r) is given in (3). Thus, when sampling 




and the alternative model as 





the unit vector 
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from M p we have that 



( oo, if lim 6 p i > 6(r), 
lunjM p ]B pl (y) = K 



•Jpi 

p— >oo 



where the distance 5 P \ is given by 



s i „ f T _i- - v 

ncr z \ n J a z p 



AW 



If we have a multiway completely randomized design the same results hold, 
as such a design is equivalent to a one-way design. For example, suppose we 
have a three-way full factorial with the model 

Vijk = Vi + Tj + 7fe + {^ T )ij + {H7)ik + ( T l)jk + (MTrfikj + Eijk, 

(7) 

* = i,---,-f,i = i)---, J,k = i,...,K. 

The number of parameters (with identifiability restrictions) is UK, and 
thus we are again in the case of Theorem 2 with 6 = 1. Any null hypothesis 
will result in a model Mi with a reduced set of parameters that will satisfy 
a < 6 of Theorem 2. Thus when sampling from the full model, the intrinsic 
Bayes procedure is consistent only if 5 P ± > S(r), where <5(r) is given in (3), 
and, analogous to the one-way case, 5 p i is equal to the sum of squares of the 
differences between the null model coefficients and the full model coefficients. 
Extension to higher-order designs is straightforward. 

3.1.2. Reduced model specification. In higher-order ANOVA models, it is 
often the case that some interaction terms are not specified. In particular, if 
the highest order interaction is not in the model, we can attain consistency of 
the intrinsic Bayes factor over the entire parameter space. We illustrate this 
with the three-way model (7); the extension to higher-order models should 
be clear. 

If we eliminate the term (fJ^Tj)^ from the model (7), then there are at 
most 

p = I + j + K + IJ + IK + JK 

parameters in the full model Mi. Since there are n = rUKL observations, 
it immediately follows that 

0(n), if I or J or K — > oo, 
o(n), if I and J and K -4 oo. 

So in the first case we can apply Theorem 2, and, similar to the full model 
evaluation, there will be an inconsistency region. However, in the second 
case, when all of /, J, and K — > oo we are in the case of Corollary 4; there 
is no inconsistency region and the Bayes factor for the intrinsic priors is 
consistent in the entire parameter space. 
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3.2. Nested regression models. Clustering, multiple change points and 
spline regression are examples of model selection problems for which the 
dimension of the alternative models grows at the same rate as the sample 
size n. Therefore, in the notation of the preceding sections 6=1, and hence 
the Schwarz approximation is inconsistent, but the Bayes factor for intrinsic 
priors is consistent except for a small region around the null model. Note 
that the null model in clustering is the one cluster model, in the multiple 
change points problem the null model is the no change model, and in spline 
regression the null model is the model that specifies no knots. 

4. Comparison with previous findings. As we have seen in Section 3.1.1, 
in the homoscedastic AN OVA there is a region of inconsistency for the Bayes 
factor for intrinsic priors. This result seems to be in contradiction with the 
finding by Berger, Ghosh and Mukhopadhyay (2003) who consider the Bayes 
factor for normal priors. The models they compare are essentially 

Mi: I] lN(w\0 t l)\, 



(8) 



=ij=l 



m 2 j nn N(viM> 1)^(^1*) = n>^i°> 1 \ , 
{i=ij=i i=i ) 

where n (fi p \t) is the intrinsic prior when a training sample of size t is consid- 
ered. Observe that the hyperparameter t controls the degree of concentration 
of the intrinsic priors around the null, and it usually ranges from 1 to r so 
as to not exceed the concentration of the likelihood of jii [for a discussion 
on the topic see Casella and Moreno (2009)]. 

For a given sample y = {yij, j = 1, ■ . ■ , r, i = 1, . . . ,p}, the Bayes factor 
for intrinsic priors to compare the Bayesian model M 2 against M\ is 

B 2 i(y|i)=(^) ' exp< 

and it satisfies 

1 P 

oo, if lim - tV? > R(t, r), 




(9) lim [M 2 ]B 21 {y\t) 

p— ¥00 



y i=l 
1 P 



p—too p 

i=l 



0, if Jim ^^i4<R(t,r 



where R(t,r) = (2r + t)(2r 2 ) -1 ln[(2r + t)t~ x ] -r" 1 , l<t<r. 

As a curiosity we mention that the function R(t,r) is related to the func- 
tion 5(r) of Theorem 2 in the following way: 

R(2,r) <5(r) <R(l,r), r > 1. 
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The Bayes factor for intrinsic priors is not consistent for all possible al- 
ternative sampling models, and thus we cannot call it a consistent model 
selector. For each t and r, the inconsistency region in the alternative para- 
metric space will be denoted as 




We note that the bound R(t, r) is a decreasing function in both arguments 
t and r, and lim,--^ R(t, r) = lim^oo R(t, r) = 0. 

It turns out that for some extreme priors the Bayes factor is a consistent 
model selector. We present two extreme cases: the first one where the prior 
degenerates to a point mass, and the second one for intrinsic priors with 
variances that tend to zero. 

1. Simple null versus simple alternative. As a modification of (8), suppose 
we want to choose between 

p r p r 

i=lj=l i=lj=l 

where {n%o,i > 1} is an arbitrary but specified sequence such that 
lhrip-j.oo f-jo/p > 0- Then, the Bayes factor i?2i(yp) satisfies 
lim p _ > . 00 [M2]i?2i(y) = oo; that is, the Bayes factor is consistent under the 
alternative. This simple result means that when the prior distribution on 
the alternative degenerates to a point mass, consistency of the correspond- 
ing Bayes factor holds. 

2. Mixture priors. The presence of uncertainty in the alternative mod- 
els provokes the appearance of an inconsistency region C(t,r). However, in 
Berger, Ghosh and Mukhopadhyay (2003), they use a continuous version of 
the intrinsic prior above and augment M2 by mixing the variance 1/t of the 
N(fj,i\0, 1/t) with a hyperprior density g{t). (Special cases they consider are 
to take g to be either gamma or beta, yielding priors that they refer to as 
Cauchy and Smooth Cauchy.) For these general mixture priors they prove 
the following theorem. 

Theorem [Berger, Ghosh and Mukhopadhyay (2003), Theorem 3.1]. For 
any prior of the form 

roo j.p/2 

(11) ^ ) = l (2^ e ~ (t/2)El ^ Wdt 

with g{t) having support on (0, 00) , the Bayes factor is consistent under Mi . 
Consistency under M2 holds if 

(12) r 2 = lim iy>?>0. 
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How do we reconcile (9) and (12), an apparent paradox? To obtain con- 
sistency for any alternative sampling model, we need the function in (9) 
to be zero, but this only occurs when t goes to infinity because r is fixed. 
Since the inconsistency regions {C(t,r),t > 1} form a monotone decreasing 
sequence, the limit is Coo(r) = f^i C(*, r-) = {/ij : lmip-Kxj | Y,i Mi = °}> a 
point that does not belong to the alternative parameter space. In the above 
theorem this is exactly what the prior TT g (n) does by incorporating priors 
with variance that tends to zero. (Something similar produces the so-called 
Lindley paradox when testing that the mean of a normal is zero; as the 
variance of the normal prior goes to zero less and less prior mass is given to 
any neighborhood of the null.) 

Certainly, if we mix values of t from 1 to r < oo, for instance mixing all 
the intrinsic priors, the intersection of these inconsistency regions C r (r) = 
f]l =1 C(t,r), is a nonempty set in the alternative model space, and hence 
the inconsistency region does not disappear. This is also noted by Berger, 
Ghosh and Mukhopadhyay in their Theorem 3.2. that we state here using 
our notation. 

Theorem [Berger, Ghosh and Mukhopadhyay (2003), Theorem 3.2]. For 
any prior of form (11), with g(t) being supported on a finite interval [0,1], 
and r = 1, the Bayes Factor is inconsistent under M2 for < r 2 < 2 log 2 — 1 . 

We note that here t = 2 and R(2, 1) = 2 log 2 - 1. 

5. Discussion. In our previous work [Casella et al. (2009)], where we 
looked at consistency of Bayes factors for a fixed number of parameters, we 
found that both the Bayes factor for intrinsic priors and the Schwarz ap- 
proximation to a Bayes factor had the same asymptotic behavior, and both 
were consistent. In this paper we have derived the asymptotic behavior of 
the Bayes factor for intrinsic priors and the Schwarz approximation when 
the dimension of the model grows with the sample size, and we note an inter- 
esting dichotomy in their performance. The Bayes factor for intrinsic priors 
and the Schwarz approximation have very different asymptotic behavior for 
the usual case where the dimension of the full model grows at the same rate 
as the sample size with the Bayes factor for intrinsic priors clearly being the 
optimal one. 

We summarize the consistency regions of the Bayes factor for intrinsic 
priors for different values of a and 6 in Table 1, and we extract the follow- 
ing recommendations. For models with b < 1, the existence of very many 
parameters is not an inconvenience as far as consistency is concerned. For 
models with 6=1, there is a small inconsistency region around the null de- 
fined by the function 5 that decreases rapidly as r increases. It also follows 
that inconsistency is the exception for the Bayes factor for intrinsic priors, 



16 



E. MORENO, F. J. GIRON AND G. CASELLA 
Table 1 



Rate of divergence 


Consistency region of B p i(y) 


0<a=&=l 


M p : lini n ^oo S pi > 5(r, s) 


< a < b= 1 


M p : lim n ^oo 8 pi > 8(r) 


0<a<&<! 


M p :lim n ^oo 8 pi > 



while the rule is consistency, and this gives credence to the Bayes factor for 
intrinsic priors as a powerful objective tool for model selection. 

APPENDIX: PROOF OF LEMMA 1 

Part 1 follows from Theorem 1 in Casella et al. (2009). To prove part 2, 
we note that X n can be written as 

V W r . 

where V n ~ (l/n)x p _i(n5 p i) and W n ~ (l/n)Xn- p - The means and variances 
of these random variables are 

E(K) = V + — , E(W n ) = l-- 
n n 



and 



Var(K) = ^ + H^ZO , Var(iy n ) = 2 ±^. 



n n 
From these expressions the three cases follow: 

(i) If a < b = 1, then when sampling from model M p 

1 1 1-1/r 

V n ->6 + -, W„->1-- and X n -»• '—. 

r r l + o 

(ii) If a = b = 1, then when sampling from model M p 

11 1 1-1/r 

V n ^5+ , W n ^l-- and X n ->■ 



rs r 1 + d — 1/s 

(hi) If 6 < 1 , then when sampling from model M p 

1 

V n ^-S, W n ^l and X n ^ — — . 

l + o 
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